Combining convolutional neural networks and self-attention for fundus diseases identification

Early detection of lesions is of great significance for treating fundus diseases. Fundus photography is an effective and convenient screening technique by which common fundus diseases can be detected. In this study, we use color fundus images to distinguish among multiple fundus diseases. Existing research on fundus disease classification has achieved some success through deep learning techniques, but there is still much room for improvement in model evaluation metrics using only deep convolutional neural network (CNN) architectures with limited global modeling ability; the simultaneous diagnosis of multiple fundus diseases still faces great challenges. Therefore, given that the self-attention (SA) model with a global receptive field may have robust global-level feature modeling ability, we propose a multistage fundus image classification model MBSaNet which combines CNN and SA mechanism. The convolution block extracts the local information of the fundus image, and the SA module further captures the complex relationships between different spatial positions, thereby directly detecting one or more fundus diseases in retinal fundus image. In the initial stage of feature extraction, we propose a multiscale feature fusion stem, which uses convolutional kernels of different scales to extract low-level features of the input image and fuse them to improve recognition accuracy. The training and testing were performed based on the ODIR-5k dataset. The experimental results show that MBSaNet achieves state-of-the-art performance with fewer parameters. The wide range of diseases and different fundus image collection conditions confirmed the applicability of MBSaNet.


Related Work
Fundus disease identification method. Fundus photography is a common method for fundus disease examination; Compared with other examination methods such as fundus fluorescein angiography (FFA) and fundus Optical Coherence Tomography (OCT), it has the advantages of low cost, fast detection speed, and simple image acquisition. In recent years, with the continuous advancement of CV and image processing technology, disease screening and identification methods based on fundus images have emerged. Considering the characteristics of image datasets, a shallow CNN 19 was designed for automatic detection of age-related macular degeneration (AMD), the average accuracy of ten-fold cross validation was 95.45% , and the average accuracy of blindfold was 91.17%. 20 employed the Inception-v3 structure to diagnose diabetic retinopathy, trained on 128,175 fundus images, and then demonstrated good results on two validation datasets, demonstrating that deep learning technology can be applied to ophthalmic illness diagnoses. Based on EfficientNet, a model integration strategy was proposed 16 , inputting the color and gray versions of the same fundus image into two EfficientNets with the same architecture for training, and finally integrating the output results of the two models to obtain the final output. Considering the possible correlation between the fundus images of both eyes of the same patient, a dense correlation network (DCNet) 21 was devised to aggregate related characteristics based on the dense spatial correlation between paired fundus pictures. Several alternative backbone feature extraction networks are employed for trials on the ODIR-5K dataset, indicating that the fusion has been completed. The DCNet module effectively improved the recognition accuracy of fundus illnesses, according to the trial data. To extract the depth features of the fundus images, 22 used the R-CNN+LSTM architecture. The classification accuracy was enhanced by 4.28% and 1.61% , respectively, by using the residual method and adding the LSTM model to the RCNN+LSTM model. In terms of feature selection, the 350 deep features are subjected to a multi-level feature selection approach known as NCAR, which improved accuracy and reduced the support vector machine (SVM) classifier's computation time. For the detection of glaucoma, diabetic retinopathy and cataracts from fundus images, three pipelines 23 were built in which twelve deep learning models and eight support vector machine (SVM) classifiers were trained, using different pretrained models such as Inception-v3, Alexnet, VGGNet and ResNet. The experimental results show that the inception-v3 model had the best performance with an accuracy of 99.30% and an f1-score of 99.39%. 24 employed transfer learning to classify diabetic retinopathy fundus images. Experiments on the DR1 and MESSIDOR public datasets indicated that knowledge learned in other large datasets (source domain) could be better classified in small datasets (target domain) via transfer learning. 25 developed an enhanced residual dense block CNN, which could effectively classify fundus images into "good quality" and "low quality" to avoid delaying patient treatment and solve the problem of quality classification of fundus images. 26 offered a six-level cataract grading method that focuses on multifeature fusion and extracted features from the residual network (ResNet-18) and gray-level cooccurrence matrix (GLCM), with promising results.
Transformer architecture. Transformer 8 is an attention-based encoder-decoder architecture that has revolutionized the field of natural language processing. Recently, inspired by this major achievement, several pioneering studies have been carried out in the computer vision (CV) field, demonstrating their effectiveness in www.nature.com/scientificreports/ various CV tasks. With competitive modeling capabilities, VITs achieve impressive results on multiple benchmarks such as ImageNet, COCO, and ADE20k, compared with existing CNNs. As the spearheading work of Transformer within the CV field, the visual Transformer (ViT) 11 structure can accomplish fabulous performance on ImageNet. Be that as it may, an impediment of ViT is the requirement for large-scale datasets, such as Ima-geNet-21k 12 and JFT-300M 12 (which may be a private dataset), to obtain pretrained models. In spite of the fact that SA modules are able to improve recognition accuracy, they more often than not bring about extra computation and are hence frequently seen as add-ons to CNNs, similar to the squeeze-and-excitation (SE) 27 modules. By contrast, following the success of ViT, a novel research direction has emerged, designed from the Transformer backbone, to incorporate explicit convolutions or other desirable convolutional properties. For example, a layerby-layer Tokens-To-Token (T2T) transformation 28 was developed to gradually convert photos into tokens and produce local structural information. Further, they provided a T2T-ViT backbone with a deep-narrow architecture, which somewhat alleviated ViT's reliance on large-scale datasets. 15 proposed the Swin Transformer, which enables state-of-the-art methodologies in various CV tasks, such as image classification, object identification, and semantic segmentation, in addition to employing Transformers for image classification. Based on the Swin Transformer and to overcome the intrinsic locality limitations of convolutional operations, recently, 29 proposed SwinE-Net, which effectively improved the robustness and accuracy of polyp segmentation by combining Effi-cientNet and Swin Transformer to maintain global semantics without sacrificing the low-level features of CNN. Some researchers have proposed hybrid approaches that combine convolutional and SA modules in the same architecture instead of utilizing pure attention models. For example, the Convolutional Enhanced Image Transformer (CeiT) 30 was introduced, which uses CNN to extract low-level characteristics before using the Transformer to construct long-range dependencies. 31 's BoTNet combines the SA module into ResNet, allowing it to outperform ResNet in image classification and object identification tasks. Similarly, 18 presented CoAtNet, a basic yet effective network structure made up primarily of MBConv blocks 32 and Transformer blocks. Contrary to BoTNet, CoAtNet uses the MBConv block as the major component rather than the residual block, and the Transformer block is located in the last two stages rather than the final stage. CoAtNet can accomplish good generalization like CNN and superior model capacity like Transformer by employing this design. In addition, 33 introduced the CNNs Meet Transformers (CMT) block, and 34 proposed the convolutional ViT (CvT) architecture, which integrates convolutional layers with Transformers into a single block. The CMT and CvT designs, like ResNet 5 , contain multiple stages for generating feature maps of various sizes, each of which is made up of CMT/CvT blocks.

Results
This section presents the experimental results obtained on the ODIR-5K dataset, comparing the proposed MBSaNet against diferent baselines. Implementation details. All experiments were performed on a dedicated server, the CPU is Intel Xeon Gold 6226R, 16 cores and 32 threads, the GPU is NVIDIA RTX5000, the memory is 32gb, and the GPU memory is 16 gb. To verify the effectiveness of the proposed model, we designed multiple sets of comparative experiments. We use the data-augmented original dataset for training, an off-site test set of 1,000 images, an on-site test set of 2,000 images, and a balanced test set of 400 images for testing. The hyperparameter settings are shown in Table 1.
Comparison experiment with CNNs and other hybrid models. Owing to the robust feature learning ability of CNNs, which avoids the tedious steps of manually designing features in traditional methods, CNNs have been the main model architecture for CV since the great breakthrough of AlexNet 39 . Recently some proposed CNN architectures have enabled models to attain state-of-the-art performance in tasks such as image classification and object detection in recent years. For performance testing, We compared MBSaNet with mainstream CNN backbone models on three independent test sets. The results showed that in the off-site test set, MBSaNet can achieve an AUC value of 0.891, a Kappa value of 0.438, an F1-score of 0.881, and a final score of   Figure 1.
Hybrid models based on CNN and Transformer have achieved state-of-the-art performance on large-scale datasets such as ImageNet, but they have not yet been applied in the field of fundus disease recognition with low image data quantity. To evaluate their performance and compare with MBSaNet, we conduct experiments with two Coat 38 models, two different configuration models in the CoAtNet family 18 , and BotNet50 31 . To ensure fairness, we apply the parameter settings in Table 1 to all models and use the same data-augmented training set. The experimental results are shown in Tables 2, 3

and 4.
Comparison with previous work. In this subsection, the advanced nature of MBSaNet is verified by comparing it with several previous studies. Among them, 16 proposed a model integration strategy, inputting the color and gray versions of the same fundus image into two EfficientNets with the same architecture for training, and integrating the output results of the two models to obtain the final output. 40 used the Inception-v3 35 model, replacing the network's randomly generated weight parameters at the start of training with weight parameters www.nature.com/scientificreports/ that had been previously trained on ImageNet, and used the data-augmented image dataset for training. The experimental results on the off-site test set containing 1,000 fundus images are shown in Table 5.
Ablation study. In this section, we investigate the effects of using various stacking schemes in the stem stage, and the performance impact of using global SA in the final stage of our multistage feature extraction network. The same settings as in Table 1 are used for a fair comparison. We compare the performance of six different schemes of vertically stacked convolutions and horizontally stacked convolutions on the off-site test set. The specific combinations of the schemes are described in Table 6. The experimental results are shown in Figure 2, we demonstrating that stacking convolutions horizontally to widen the stem structure is more efficient than stacking convolutions vertically. Meanwhile, we observe the drop in metrics from replacing multiscale feature fusion stem (MFFS) with a single-scale feature fusion stem, and that using convolution kernels of different scales is more conducive to extracting high-quality features. By introducing an MBSaNet -variant, MBNet a network that uses only improved MBConv blocks, we verify the effectiveness of the global SA module. Based on the feature maps extracted in the convolution stage, a two-layer SA module is utilized in the final stage to further capture long-term dependencies, which significantly improves the feature modeling ability.

Discussion
We introduced MBSaNet, a novel model based on the SA mechanism for fundus image classification, which is the first application of Transformer architecture in the field of fundus multidisease recognition, and hence provides a new idea for the research of SA models in the field of medical image processing. The experimental results showed that compared with many popular backbone networks, MBSaNet has higher accuracy in the recognition task of multiple fundus diseases. The wide range of image sources and the huge intra-category discriminations brought about by different camera acquisitions demonstrate the robust feature extraction capabilities of MBSaNet, indicating its great potential in assisting ophthalmologists in clinical diagnosis, especially in the identification of glaucoma, cataract, AMD, hypertensive retinopathy and myopic retinopathy. Figure 3 shows MBSaNet prediction results on some sample images from the test set.
By explicitly combining convolutional layers and SA layers in a multistage network, the model achieves a good balance between generalization performance and global feature modeling ability; while generalizing well on smaller datasets, high-quality semantic features can also be extracted from fundus images for decision-making by fully connected layers. From the experimental results, we can see that compared with the convolutional networks, MBSaNet achieved better performance with fewer parameters, in which the Kappa value was 5 percentage points higher than the best performing CNN model, indicating that MBSaNet's prediction results are more consistent www.nature.com/scientificreports/ with the actual classification results, and the model is less biased toward categories , which makes sense on imbalanced datasets. In contrast, the accuracy metric is less relevant because there is a huge imbalance in the sample size of each category, and the model can obtain high accuracy by directly classifying the test sample into a category with large sample size.
We also compared MBSaNet with other hybrid models, and MBSaNet shows obvious advantages over other models. The poor performance of the other hybrid models on the fundus dataset can mainly be attributed to the fact that their generalization performance is not sufficient for the ODIR-5K dataset, although we have employed data augmentation techniques. Among them, although MBSaNet has a certain similarity with the CoAtNet models, there is a huge gap in the final score. We believe that this is mainly related to the use of SA modules in the last two stages of feature extraction in CoAtNet, no matter which configuration of CoAtNet, the stacking number of modules in the penultimate stage is the largest, and the amount of calculation is also the largest, choosing to use the SA module that lacks inductive bias, which will reduce the generalization performance of the model on smaller datasets. In addition, the number of hidden dimensions at each stage also affects the performance. In the experimental comparison with previous studies, on the three important metrics, AUC, Kappa, and F1-score, our MBSaNet only has a lower Kappa value than the model of 16 . Notably, the AUC value of MBSaNet far exceeds those of other models, considering that ODIR is an unbalanced dataset, and AUC is not sensitive to whether the sample size is balanced, it indicates that MBSaNet is a more ideal model for classification of multiple fundus diseases.
According to the prediction results of several models for balanced test set, the recognition accuracy for images with label O is generally poor, mainly because the label contains too many images of different categories, resulting in too large intra-class gap, making it difficult for the model to effectively partition them. www.nature.com/scientificreports/ In ablation experiments, the networks with horizontally widened stems have better performance, and the network with MFFS achieves the best performance, which shows that this simple structure is effective, extracting image features at different scales and fusing them at the initial stage can help improve the classification performance. In addition, compared with the variant-MBNet, MBSaNet has better performance on all classification indicators, which indicates that by introducing the global receptive field and enhancing the global modeling ability of the model, the pathological features of different lesions in the fundus image can be extracted more effectively.
Due to the use of different camera equipment under different environmental conditions, the fundus images used in this study have high diversity. Hence, we adopted certain image preprocessing methods to Enhance contrast of the images features and expand the training dataset, on the premise of preserving the original image features as much as possible. Both raw and processed images are fed into the model for training, which can provide useful features for the identification of multiple fundus diseases. Some limitations of this study are as follows: (1) limited number of images in some categories may affect the performance of the model, although high diversity fundus images are used. (2) The distributions of categories in the on-site and off-site test datasets are unbalanced, and it is difficult to assess the classification accuracy of the model for a specific disease. (3) We eliminated a few images that were marked as low image quality, however, these images are unavoidable in practical situations. (4) It was found out that the effect of increasing the number of fully-connected layers of a neural networks depends on the type of data set being used 42 , in our experiments, we found that in the convolution stage, the number of hidden dimensions also has a great impact on the recognition accuracy of fundus diseases, which is worth further study.

Methods
MBSaNet is proposed to improve the performance of classification models on the task of automatic recognition of multilabel fundus diseases. The main idea of MBSaNet is based on the explicit combination of convolutional layers and SA layers, which enables the model to have both the generalization ability of CNN and the global feature modeling ability of Transformer 18,43 . Previous studies have demonstrated that the local prior of the convolutional layer makes it good for extracting local features from fundus images; however, we believe that longterm dependences and the global receptive field are also essential for fundus disease identification, because even an experienced ophthalmologist is unable to make an accurate diagnosis from a small part of a fundus image (e.g., using only a macula). Considering that the SA layer with global modeling ability can capture long-term dependencies, MBSaNet is implemented by adopting a building strategy similar to the CoAtNet 18 architecture with vertically stacked convolutional blocks and self-attention modules. The overall framework of MBSaNet is shown in Figure 4, and Table 7 shows the size of the input and output feature maps at each stage of the model. The framework comprises two parts. The first of which is a feature extractor with five stages: Stage0-Stage4, where Stage0 is our proposed multiscale feature fusion stem (MFFS), Stage1-Stage3 are all convolutional layers, and www.nature.com/scientificreports/ Stage 4 is an SA layer with relative position representations. The second part is a multilabel classifier that predicts the sample category based on the features extracted from the above structure. We use the MBConv block that includes residual connections and an SE block 27 as basic building blocks in all convolutional stages due to the same reverse bottleneck design as the Feedforward Network (FFN) block of Transformers. Unlike the regular MBConv block, MBSaNet replaces the max-pooling layers in the shortcut branch with convolutional layers having stride 2 in the downsampling strategy. This is a custom neural network that needs to be implemented by training it from scratch.
Dataset. The dataset obtained from the "International Competition on Ocular Disease Intelligent Recognition" sponsored by Peking University. This dataset contains "real" patient data collected from different hospitals and medical centers in China, which were jointly launched by the Nankai University School of Computer Science-Beijing Shanggong Medical Information Technology Co., Ltd. joint laboratory. The training set is a structured ophthalmology database that includes the ages of 3,500 patients, color fundus images of their left and right eyes, and diagnostic keywords from clinicians. The test set includes off-site test set and on-site test set, but as with the training set, the number of samples under each category is unbalanced. Therefore, we also constructed a balanced test set with 50 images per class by randomly sampling a total of 400 images from the training set. The specific details of the dataset can be found in Table 8. Fundus images were recorded by various cameras, including Canon, Zeiss, and Kowa, with variable image resolutions. As illustrated in Figure 5 There are two points to note. First, a patient may contain one or more labels, as shown in Figure 5(b), that is, the task is a multidisease multilabel image classification task. Second, as shown in Figure 5(c), the class labeled Other Diseases/Abnormalities (O) contains images related to more than 10 different diseases, and low quality images due to factors such as lens blemishes, and invisible optic discs, variability is largely expanded in. All the methods developed and experiments were carried out in accordance with the relevant guidelines and regulations associated to this publicly available dataset.
Evaluation metrics. Accuracy is the proportion of correctly classified samples to the total samples, which is the most basic evaluation indicator in classification problems. Precision refers to the probability that the true label of a sample is positive among all samples predicted to be positive. Recall refers to the probability of being  www.nature.com/scientificreports/  www.nature.com/scientificreports/ predicted by the model to be a positive sample among all the samples with positive labels, and given the specificity of the task, we use a micro-average of precision and recall for each category in our experiments. AUC is the area under the ROC curve, and the closer the value is to 1, the better the classification performance of the model. AUC is often used to measure model stability. The Kappa coefficient is another index calculated based on the confusion matrix, which is used to measure the classification accuracy of the model and can also be used for consistency testing, where p0 denotes the sum of the diagonal elements divided by the sum of the entire matrix elements, i.e., accuracy. pe denotes the sum of the products of the actual and predicted numbers corresponding to all categories, divided by the square of the total number of samples. F1 _ score, also known as BalancedScore, is the harmonic (weighted) average of precision and recall, and given the category imbalance in the dataset, we use micro-averaging to calculate metrics globally by counting the total true positives,false negatives and false positives. The closer the value is to 1, the better the classification performance of the model. Final _ score is the average of F1 _ score, Kappa, and AUC.
Data preprocessing. The fundus image dataset contains some low-quality images, which are removed since it would not be helpful for training. In order to minimize the unnecessary interference to the feature extraction process due to the extra noise brought by the black area of the fundus images, the redundant black area is cropped. We use the OpenCV library to load the image as a pixel vector and use the edge position coordinates of the retinal region of the fundus image to remove the black edges. The fundus images are further resized to a 224×224 image size after being cropped as shown in Figure 6. Data augmentation is the artificial generation of different versions of a real dataset to increase its data size; the images after data augmentation are shown in Figure 7. Because it is necessary to expand the size of the dataset based on retaining the main features of the original image, we use operations such as random rotation by 90 • , adjustment of contrast, and center cropping. Finally, the global histogram equalization operation is performed on the original and enhanced images, so that the contrast of the images is higher and the gray value distribution is more uniform.
Multiscale feature fusion stem. The predictive ability of a classifier is closely related to its ability to extract high-quality features. In the field of fundus multidisease identification, owing to the different characteristics of the lesions reflected in the fundus images of several common eye diseases, the lesion areas have the characteristics of different sizes and distributions. We propose a feature fusion module with convolution kernels of different sizes to extract multiscale primary features of images in the input stage of the network and fuse www.nature.com/scientificreports/ them in the channel dimension. Feature extractors with convolution kernel sizes of 3 × 3, 5 × 5, 7 × 7, and 9 × 9 are used, since the convolution stride is set to 2, we padding the input image before performing each convolution operation to ensure that the output feature maps are the same size. By employing convolution kernels with different receptive fields in the horizontal direction to broaden the stem structure, more locally or globally biased features are extracted from the original images. The batch normalization operation and ReLU activation are then performed separately and the resulting feature maps are concatenated. The experimental results show that by widening the stem structure in the horizontal direction, higher quality low-level image features can be obtained at the primary stage.
Multistage feature extractor. CNNs  The convolution operation with a convolution kernel size of 2 × 2 and a stride of 2, implements the output feature map size on the shortcut branch to match the output size of the residual branch. The experimental results show that this slightly improves the performance. The convolutional building blocks we use are shown in Figure 8, and the downsampling implementation can be expressed as Formula 8.
where x i , y i ∈ R D denote the input and output at position i, respectively, and L(i) denotes a local neighborhood of i, e.g., a 3 × 3 grid centered at i in image processing.
In natural language processing and speech understanding, the Transformer design, which includes a crucial component of the SA module, has been widely used. SA extends the receptive field to all spatial places and computes weights based on the re-normalized pairwise similarity between the pair (x i , x j ) , as shown in Formula 9, where G indicates the global spatial space. Stand-alone SA networks 33 have shown that diverse CV tasks may be performed satisfactorily using SA modules alone, albeit with some practical limitations, in early research. After pretraining on the large-scale JFT dataset, ViT 11 applied the vanilla Transformer to ImageNet classification and produced outstanding results. However, with insufficient training data, ViT still trails well behind SOTA CNNs. This is mainly because typical Transformer architectures lack the translation equivalence 18 of CNNs, which increases the generalization on small datasets 46 . Therefore, we decided to adopt a method similar to CoAtNet; the global static convolution kernel is summed with the adaptive attention matrix before softmax normalization, which can be expressed as Formula 10, where (i, j) denotes any position pair and w i−j denotes the corresponding convolution weights, improve the generalization ability of the network based on the Transformer architecture by introducing the inductive bias of the CNNs. www.nature.com/scientificreports/ The receptive field size is one of the most critical differences between SA and convolutional modules. In general, a larger receptive field provides more contextual information, but this usually results in higher model capacity. The global receptive field has been a key motivation for employing SA mechanisms in vision. However, a larger receptive field requires more computation. For global attention, the complexity is quadratic w.r.t. spatial size. Therefore, in the process of designing the feature extraction backbone, considering the huge computational overhead brought by the Transformer structure and the small amount of training data for practical tasks, we use more convolution blocks, and only set up two layers of SA modules in Stage4 in the feature extraction stage. Experimental results show that this achieves a good balance between generalization performance and feature modeling ability.
Multilabel loss function. The fundus disease recognition task is a multilabel classification problem, so it is unsuitable for training models with traditional loss functions. We refer to the loss function used in work 16,40 , all classified images can be represented as X ={x 1 , x 2 ...x i ...x N } , where x i is related to the ground truth label y i , and i = 1...N , N represents the number of samples. We wish to find a classification function F : X −→ Y that minimizes the loss function L, we use N sets of labeled training data (x i , y i ) , and apply a one-hot method to each y i is encoded, y i = [y 1 i , y 2 i ...y 8 i ] , each y contains 8 values, corresponding to the 8 categories in the dataset. We draw on the traditional multilabel classification method based on problem transformation, and transformed the multilabel classification problem into a two-class classification problem for each label. The final loss is the average of the loss values of the samples corresponding to each label. After studying weighted loss functions, such as sample balance and class balance, we decided to use weighted binary cross-entropy from Formula 11 as the loss function, where W = (1,1.2,1.5,1.5,1.5,1.5,1.5,1.2) denotes the loss weight. The positive class is 1, and the negative class is 0. p(y i ) is the probability that sample i is predicted to be positive.
After obtaining the loss function, we need to choose an appropriate optimization function to optimize the learning parameters. Different optimizers have different effects on parameter training, so we mainly consider